Skip to contents

Coevolutionary Analysis

At this point, we’ve walked through the steps to take a set of sequences and obtain a set of COGs with phylogenetic reconstructions for each COG. We’re now ready to look for signals of coevolution, which imply functional associations between COGs. These methods are implemented via the ProtWeaver class in SynExtend, which includes many commonly used methods for detecting coevolutionary patterns.

While the previous steps have only utilized a small subsample of the data, we’re now finally going to work with the complete dataset. This dataset is comprised of the 91 Micrococcus genomes available with assemblies at the Scaffold, Chromosome, or Complete level (link to query). Note that more genomes may become available after this conference; these are all that were available at the time.

We ran the complete pipeline of identifying and annotating genes with DECIPHER, finding COGs with SynExtend, and then creating gene trees for each COG using DECIPHER. The complete data consist of 3,407 distinct COGs. All of this analysis is performed entirely within SynExtend and DECIPHER; no external packages or data are required aside from the input genomes.

We now use the new ProtWeaver class to try to find COGs that show evidence of correlated evolutionary selective pressures, also referred to as ‘coevolutionary signal’.

 

library(DECIPHER)
library(SynExtend)


COGsFile <- '/path/to/MicrococcusCOGs.RData'
AnnotationsFile <- '/path/to/MicrococcusAnnotations.RData'
TreesFile <- '/path/to/MicrococcusGeneTrees.RData'

# All of these files are formatted the same way, as lists with each entry
# corresponding to a COG. COGs entries are identifiers for each gene, CogsAnnot 
# entries are annotations, and CogTrees entries are gene trees
load(COGsFile)          # Loads 'COGs'
load(AnnotationsFile)   # Loads 'CogsAnnot'
load(TreesFile)         # Loads 'CogTrees'
# Let's look at one of the COGs
COGs[[7]]
##  [1] "1_1_919"     "12_44_1601"  "13_5_415"    "2_1_10"      "14_1_274"   
##  [6] "15_1_14"     "16_132_1767" "17_107_1177" "18_1_1825"   "19_1_1783"  
## [11] "20_1_958"    "21_223_1464" "22_44_1726"  "23_13_736"   "3_1_130"    
## [16] "24_11_767"   "25_1_915"    "26_8_258"    "27_53_1567"  "28_10_1132" 
## [21] "29_53_1900"  "30_1_72"     "31_7_808"    "32_7_458"    "33_183_1337"
## [26] "34_130_2076" "35_197_1194" "36_128_792"  "37_22_1142"  "38_81_1984" 
## [31] "4_1_9"       "39_1_1608"   "40_1_416"    "41_1_925"    "42_1_1032"  
## [36] "43_1_707"    "44_1_704"    "45_13_755"   "46_63_1530"  "47_1_970"   
## [41] "48_1_690"    "49_1_779"    "50_1_267"    "51_20_1253"  "52_1_971"   
## [46] "53_105_395"  "54_109_577"  "55_33_1488"  "56_11_533"   "57_161_2179"
## [51] "58_111_515"  "59_55_1658"  "5_1_9"       "60_50_1995"  "61_1_78"    
## [56] "62_8_204"    "63_155_1042" "64_47_712"   "65_96_2144"  "66_11_774"  
## [61] "67_66_1831"  "68_22_82"    "69_22_1206"  "70_34_781"   "71_96_2235" 
## [66] "72_1_1811"   "73_11_821"   "74_1_221"    "75_15_1093"  "76_1_2110"  
## [71] "77_1_696"    "78_1_1963"   "79_12_872"   "80_12_855"   "81_48_2136" 
## [76] "82_41_671"   "83_20_1289"  "84_185_549"  "85_42_1351"  "86_8_1238"  
## [81] "88_7_611"    "89_126_1611" "90_1_980"    "6_21_1244"   "7_40_2015"  
## [86] "8_1_348"     "9_5_575"     "10_1_382"    "11_1_10"     "11_12_1037" 
## [91] "91_1_985"
CogsAnnot[[7]]
##   A test set of class 'Taxa' with length 91
##      confidence name                 taxon
##  [1]       100% 50_1_267             Root; 09120 Genetic Information Processi...
##  [2]       100% 51_20_1253           Root; 09120 Genetic Information Processi...
##  [3]       100% 52_1_971             Root; 09120 Genetic Information Processi...
##  [4]        71% 53_105_395           Root; 09120 Genetic Information Processi...
##  [5]       100% 54_109_577           Root; 09120 Genetic Information Processi...
##  ...        ... ...                  ...
## [87]       100% 45_13_755            Root; 09120 Genetic Information Processi...
## [88]       100% 46_63_1530           Root; 09120 Genetic Information Processi...
## [89]       100% 47_1_970             Root; 09120 Genetic Information Processi...
## [90]       100% 48_1_690             Root; 09120 Genetic Information Processi...
## [91]       100% 49_1_779             Root; 09120 Genetic Information Processi...
CogsAnnot[[7]][[1]]
## $taxon
## [1] "Root"                                                             
## [2] "09120 Genetic Information Processing"                             
## [3] "09122 Translation"                                                
## [4] "03010 Ribosome [PATH:ko03010]"                                    
## [5] "K02899  RP-L27, MRPL27, rpmA, large subunit ribosomal protein L27"
## 
## $confidence
## [1] 100 100 100 100 100
CogTrees[[7]]
## 'dendrogram' with 2 branches and 91 members total, at height 0.3813559
# Note that tree labels correspond to gene identifiers
setequal(COGs[[7]], labels(CogTrees[[7]]))
## [1] TRUE

There is a ton of data here, and we unfortunately don’t have time to look at all of it. To demonstrate some of the things we can do with ProtWeaver, we’re going to look at subset of the data that is easier to investigate.

We’ll subset the COGs to ones that meet the following characteristics:

  • COGs with no paralogs
  • COGs not part of the core genome
  • COGs present in at least 5 genomes (not singletons or super rare ones)
  • COGs with at least one high confidence annotation
  • COGs that imply a coding region

These are relatively arbitrary requirements, but they subset the data to an example that runs quickly and is easily understandable. This essentially takes a group of COGs that are not essential to the organisms but tend to appear a lot, that have been characterized individually before.

## Subsetting COGs

# Cutoff values (91 genomes total!)
CoreCutoff <- 88
UnderCutoff <- 5

# Get assembly identifiers for each COG
truncCOGs <- lapply(COGs, \(x) sort(as.integer(gsub('([0-9]*)_.*', '\\1', x))))

# Find COGs without paralogs
noParas <- sapply(truncCOGs, \(x) length(x) == length(unique(x)))

# Get non-core genome
notCoreCOGs <- sapply(truncCOGs, \(x) length(unique(x)) < CoreCutoff)

# Get genes in more than 5 organisms
notSingles <- sapply(truncCOGs, \(x) length(unique(x)) > UnderCutoff)

# Make sure COGs are coding elements
codingCOGs <- sapply(CogsAnnot, \(x) is(x, 'Taxa'))

# At least one high confidence annotation
highConf <- sapply(CogsAnnot, \(x) 
                   if(is(x, 'Taxa')) 
                     max(sapply(x, \(y) 
                                y$confidence[length(y$confidence)])) > 50
                   else FALSE
                   )

# Subset for the workshop
WorkshopSubset <- noParas & notCoreCOGs & notSingles & codingCOGs & highConf

# Subset our data
WCogs <- COGs[WorkshopSubset]
WAnnots <- CogsAnnot[WorkshopSubset]
WTrees <- CogTrees[WorkshopSubset]

Let’s also put together a list of consensus annotations for each COG.

consAnnots <- vector('character', length=length(CogsAnnot))
for ( i in seq_along(CogsAnnot) ){
  taxaentry <- CogsAnnot[[i]]
  if (!is(taxaentry, 'Taxa'))
    consAnnots[[i]] <- 'NONCODING'
  else {
    annots <- sapply(taxaentry, \(y) y$taxon[length(y$taxon)])
    annots <- annots[annots!='unclassified_Root']
    if (length(annots) == 0)
      consAnnots[[i]] <- 'PNACT'
    else
      consAnnots[[i]] <- names(sort(table(annots), decreasing=T))[1]
  }
}

# Remove this when we get the dev branch of SynExtend
consAnnots <- consAnnots[WorkshopSubset]

Now we can make our ProtWeaver object. ProtWeaver has multiple input options, either a list formatted like COGs (list with gene identifiers) or a list like CogTrees (list with gene trees).

pw <- ProtWeaver(WTrees)

The ProtWeaver constuctor automatically detects the type of data you have and adjusts available predictors accordingly. While it functions best with a list of dendrograms for each COG, it can also run with simple presence/absence patterns. See the documentation file for ProtWeaver for more information on this functionality.

We’re now ready to make predictions. Predicting functional associations is done with the predict.ProtWeaver S3 method. Let’s examine possible functional associations between the COGs we have.

preds <- predict(pw)
print(preds)
## a ProtWeb object.
##  Method used: Ensemble 
##  Number of genes: 88 
##  Number of predictions: 3916

Viewing our results

Notice that preds is a ProtWeb object. This is just a simple S3 class with a pretty print method wrapping a matrix of pairwise association scores. We can get the raw data with GetProtWebData():

# Subset so the output is actually readable
GetProtWebData(preds)[1:7, 1:7]
##           1         2         3         4         5         6         7
## 1 1.0000000 0.3543835 0.2026004 0.2658768 0.2583244 0.2565627 0.2223818
## 2 0.3543835 1.0000000 0.1987633 0.3136938 0.2310440 0.3566941 0.2739222
## 3 0.2026004 0.1987633 1.0000000 0.2175542 0.2242302 0.2181276 0.1817777
## 4 0.2658768 0.3136938 0.2175542 1.0000000 0.2177652 0.2634777 0.2139104
## 5 0.2583244 0.2310440 0.2242302 0.2177652 1.0000000 0.3688554 0.2064071
## 6 0.2565627 0.3566941 0.2181276 0.2634777 0.3688554 1.0000000 0.1869671
## 7 0.2223818 0.2739222 0.1817777 0.2139104 0.2064071 0.1869671 1.0000000

The ProtWeb class will be updated next release cycle to include more methods, including a custom plotting function. The current plot.ProtWeb S3 method implements a force-directed embedding of the pairwise scores, but it’s a big work-in-progress. Stay tuned for the next release cycle for more functionality regarding ProtWeb.

In the meantime, we can use the igraph package to find clusters of coevolving COGs.

# Ensure reproducibility of igraph clustering
set.seed(123)

adjMatrix <- GetProtWebData(preds)
g <- graph_from_adjacency_matrix(adjMatrix, weighted=TRUE,
                                 mode='undirected', diag=FALSE)

clusters <- cluster_fast_greedy(g)

# Getting the clusters & identifying COGs by consensus annotation
clusterLabels <- vector('list', length(clusters))
for ( i in seq_along(clusterLabels) ){
  cluster <- communities(clusters)[[i]]
  labs <- consAnnots[as.integer(cluster)]
  clusterLabels[[i]] <- labs[order(sapply(labs, \(x) strsplit(x, ' ')[[1]][3]))]
}

# Let's examine a cluster
clusterLabels[[1]]
##  [1] "K02050  ABC.SN.P, NitT/TauT family transport system permease protein"                                          
##  [2] "K06320  cgeB, spore maturation protein CgeB"                                                                   
##  [3] "K15770  cycB, ganO, arabinogalactan oligomer / maltooligosaccharide transport system substrate-binding protein"
##  [4] "K00956  cysN, sulfate adenylyltransferase subunit 1 [EC:2.7.7.4]"                                              
##  [5] "K05499  cytR, LacI family transcriptional regulator, repressor for deo operon, udp, cdd, tsx, nupC, and nupG"  
##  [6] "K00301  E1.5.3.1, sarcosine oxidase [EC:1.5.3.1]"                                                              
##  [7] "K01500  fchA, methenyltetrahydrofolate cyclohydrolase [EC:3.5.4.9]"                                            
##  [8] "K02217  ftnA, ftn, ferritin [EC:1.16.3.2]"                                                                     
##  [9] "K15772  ganQ, arabinogalactan oligomer / maltooligosaccharide transport system permease protein"               
## [10] "K02440  GLPF, glycerol uptake facilitator protein"                                                             
## [11] "K00865  glxK, garK, glycerate 2-kinase [EC:2.7.1.165]"                                                         
## [12] "K06975  K06975, uncharacterized protein"                                                                       
## [13] "K07095  K07095, uncharacterized protein"                                                                       
## [14] "K07729  K07729, putative transcriptional regulator"                                                            
## [15] "K03719  lrp, Lrp/AsnC family transcriptional regulator, leucine-responsive regulatory protein"                 
## [16] "K23518  MACROD, ymdB, O-acetyl-ADP-ribose deacetylase [EC:3.1.1.106]"                                          
## [17] "K00549  metE, 5-methyltetrahydropteroyltriglutamate--homocysteine methyltransferase [EC:2.1.1.14]"             
## [18] "K03517  nadA, quinolinate synthase [EC:2.5.1.72]"                                                              
## [19] "K00767  nadC, QPRT, nicotinate-nucleotide pyrophosphorylase (carboxylating) [EC:2.4.2.19]"                     
## [20] "K05846  opuBD, osmoprotectant transport system permease protein"                                               
## [21] "K15866  paaG, 2-(1,2-epoxy-1,2-dihydrophenyl)acetyl-CoA isomerase [EC:5.3.3.18]"                               
## [22] "K00074  paaH, hbd, fadB, mmgB, 3-hydroxybutyryl-CoA dehydrogenase [EC:1.1.1.157]"                              
## [23] "K00156  poxB, pyruvate dehydrogenase (quinone) [EC:1.2.5.1]"                                                   
## [24] "K03343  puo, putrescine oxidase [EC:1.4.3.10]"                                                                 
## [25] "K02914  RP-L34, MRPL34, rpmH, large subunit ribosomal protein L34"                                             
## [26] "K12510  tadB, tight adherence protein B"                                                                       
## [27] "K12511  tadC, tight adherence protein C"                                                                       
## [28] "K01975  thpR, RNA 2',3'-cyclic 3'-phosphodiesterase [EC:3.1.4.58]"                                             
## [29] "K07260  vanY, zinc D-Ala-D-Ala carboxypeptidase [EC:3.4.17.14]"                                                
## [30] "K22736  VIT, vacuolar iron transporter family protein"                                                         
## [31] "K16870  wbbL, N-acetylglucosaminyl-diphospho-decaprenol L-rhamnosyltransferase [EC:2.4.1.289]"                 
## [32] "K19159  yefM, antitoxin YefM"                                                                                  
## [33] "K19158  yoeB, toxin YoeB [EC:3.1.-.-]"

Note here that we’ve identified all the COGs in this group to be associated based on coevolutionary pressures, and this conclusion is supported by annotation consistency across the members of this group. Of note here is that this clustering was done with no prior knowledge on what these COGs do; we reached this conclusion using purely sequencing data as input. This cluster is also very similar to results in the STRING database. While STRING doesn’t have the nixA, it does have Achl_0397, a similar nickle/cobalt transporter. If we had conducted this analysis with hypothetical proteins, we could have inferred novel functions!

Methods Implemented in ProtWeaver

By default, predict.ProtWeaver makes an ensemble prediction using as many individual models as it can run with the data provided. However, users are free to use any of the individual models without the ensemble predictor. The methods implemented are the following:

# PHYLOGENETIC PROFILING METHODS:
  ## P/A = Presence/Absence Profiles
  ## Jaccard distance of P/A
Jaccard <- predict(pw, method='Jaccard') 

  ## Hamming distance of P/A
Hamming <- predict(pw, method='Hamming') 

  ## MutualInformation of P/A
MutualInf <- predict(pw, method='MutualInformation')

  ## Direct Coupling Analysis of P/A
ProfDCA <- predict(pw, method='ProfDCA') 

  ## Correlation of gain/loss events on phylogeny, requires Species Tree
Behdenna <- predict(pw, method='Behdenna', mySpeciesTree=exSpeciesTree)

# CO-LOCALIZATION METHODS:
Colocalization <- predict(pw, method='Coloc') # Co-localization analysis

# DISTANCE MATRIX METHDOS:
MirrorTree <- predict(pw, method='MirrorTree')
ContextTree <- predict(pw, method='ContextTree')

# Residue Methods: (ONLY AVAILABLE IN DEV VERSION)
#   ## MutualInf of residues
ResidueMI <- predict(pw, method='ResidueMI')